173 research outputs found

    Q-PrOP: Sample-efficient policy gradient with an off-policy critic

    Get PDF
    Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is their high sample complexity. Batch policy gradient methods offer stable learning, but at the cost of high variance, which often requires large batches. TD-style methods, such as off-policy actor-critic and Q-learning, are more sample-efficient but biased, and often require costly hyperparameter sweeps to stabilize. In this work, we aim to develop methods that combine the stability of policy gradients with the efficiency of off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor expansion of the off-policy critic as a control variate. Q-Prop is both sample efficient and stable, and effectively combines the benefits of on-policy and off-policy methods. We analyze the connection between Q-Prop and existing model-free algorithms, and use control variate theory to derive two variants of Q-Prop with conservative and aggressive adaptation. We show that conservative Q-Prop provides substantial gains in sample efficiency over trust region policy optimization (TRPO) with generalized advantage estimation (GAE), and improves stability over deep deterministic policy gradient (DDPG), the state-of-the-art on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control environments

    Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning

    Get PDF
    Off-policy model-free deep reinforcement learning methods using previously collected data can improve sample efficiency over on-policy policy gradient techniques. On the other hand, on-policy algorithms are often more stable and easier to use. This paper examines, both theoretically and empirically, approaches to merging on- and off-policy updates for deep reinforcement learning. Theoretical results show that off-policy updates with a value function estimator can be interpolated with on-policy policy gradient updates whilst still satisfying performance bounds. Our analysis uses control variate methods to produce a family of policy gradient algorithms, with several recently proposed algorithms being special cases of this family. We then provide an empirical comparison of these techniques with the remaining algorithmic details fixed, and show how different mixing of off-policy gradient estimates with on-policy samples contribute to improvements in empirical performance. The final algorithm provides a generalization and unification of existing deep policy gradient techniques, has theoretical guarantees on the bias introduced by off-policy updates, and improves on the state-of-the-art model-free deep RL methods on a number of OpenAI Gym continuous control benchmarks

    High fidelity progressive reinforcement learning for agile maneuvering UAVs

    Get PDF
    In this work, we present a high fidelity model based progressive reinforcement learning method for control system design for an agile maneuvering UAV. Our work relies on a simulation-based training and testing environment for doing software-in-the-loop (SIL), hardware-in-the-loop (HIL) and integrated flight testing within photo-realistic virtual reality (VR) environment. Through progressive learning with the high fidelity agent and environment models, the guidance and control policies build agile maneuvering based on fundamental control laws. First, we provide insight on development of high fidelity mathematical models using frequency domain system identification. These models are later used to design reinforcement learning based adaptive flight control laws allowing the vehicle to be controlled over a wide range of operating conditions covering model changes on operating conditions such as payload, voltage and damage to actuators and electronic speed controllers (ESCs). We later design outer flight guidance and control laws. Our current work and progress is summarized in this work

    vrAIn: a deep learning approach tailoring computing and radio resources in virtualized RANs

    Get PDF
    Proceeding of: 25th Annual International Conference on Mobile Computing and Networking (MobiCom'19), October 21-25, 2019, Los Cabos, Mexico.The virtualization of radio access networks (vRAN) is the last milestone in the NFV revolution. However, the complex dependencies between computing and radio resources make vRAN resource control particularly daunting. We present vrAIn, a dynamic resource controller for vRANs based on deep reinforcement learning. First, we use an autoencoder to project high-dimensional context data (traffic and signal quality patterns) into a latent representation. Then, we use a deep deterministic policy gradient (DDPG) algorithm based on an actor-critic neural network structure and a classifier to map (encoded) contexts into resource control decisions. We have implemented vrAIn using an open-source LTE stack over different platforms. Our results show that vrAIn successfully derives appropriate compute and radio control actions irrespective of the platform and context: (i) it provides savings in computational capacity of up to 30% over CPU-unaware methods; (ii) it improves the probability of meeting QoS targets by 25% over static allocation policies using similar CPU resources in average; (iii) upon CPU capacity shortage, it improves throughput performance by 25% over state-of-the-art schemes; and (iv) it performs close to optimal policies resulting from an offline oracle. To the best of our knowledge, this is the first work that thoroughly studies the computational behavior of vRANs, and the first approach to a model-free solution that does not need to assume any particular vRAN platform or system conditions.The work of University Carlos III of Madrid was supported by H2020 5GMoNArch project (grant agreement no. 761445) and H2020 5G-TOURS project (grant agreement no. 856950). The work of NEC Laboratories Europe was supported by H2020 5GTRANSFORMER project (grant agreement no. 761536) and 5GROWTH project (grant agreement no. 856709). The work of University of Cartagena was supported by Grant AEI/FEDER TEC2016-76465-C2-1-R (AIM) and Grant FPU14/03701.Publicad

    Hemophilia gene therapy knowledge and perceptions: Results of an international survey

    Get PDF
    Background Hemophilia gene therapy is a rapidly evolving therapeutic approach in which a number of programs are approaching clinical development completion. Objective The aim of this study was to evaluate knowledge and perceptions of a variety of health care practitioners and scientists about gene therapy for hemophilia. Methods This survey study was conducted February 1 to 18, 2019. Survey participants were members of the ISTH, European Hemophilia Consortium, European Hematology Association, or European Association for Hemophilia and Allied Disorders with valid email contacts. The online survey consisted of 36 questions covering demographic information, perceptions and knowledge of gene therapy for hemophilia, and educational preferences. Survey results were summarized using descriptive statistics. Results Of the 5117 survey recipients, 201 responded from 55 countries (4% response rate). Most respondents (66%) were physicians, and 59% were physicians directly involved in the care of people with hemophilia. Among physician respondents directly involved in hemophilia care, 35% lacked the ability to explain the science of adeno-associated viral gene therapy for hemophilia, and 40% indicated limited ability or lack of comfort answering patient questions about gene therapy for hemophilia based on clinical trial results to date. Overall, 75% of survey respondents answered 10 single-answer knowledge questions correctly, 13% incorrectly, and 12% were unsure of the correct answers. Conclusions This survey highlighted knowledge gaps and educational needs related to gene therapy for hemophilia and, along with other inputs, has informed the development of "Gene Therapy in Hemophilia: An ISTH Education Initiative.

    Mastering the game of Go without human knowledge

    Get PDF
    A long-standing goal of artificial intelligence is an algorithm that learns, tabula rasa, superhuman proficiency in challenging domains. Recently, AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play. Here we introduce an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, our new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo

    Developing a multivariable prediction model for functional outcome after reperfusion therapy for acute ischaemic stroke: study protocol for the Targeting Optimal Thrombolysis Outcomes (TOTO) multicentre cohort study.

    Full text link
    INTRODUCTION:Intravenous thrombolysis (IVT) with recombinant tissue plasminogen activator (rt-PA) is the only approved pharmacological reperfusion therapy for acute ischaemic stroke. Despite population benefit, IVT is not equally effective in all patients, nor is it without significant risk. Uncertain treatment outcome prediction complicates patient treatment selection. This study will develop and validate predictive algorithms for IVT response, using clinical, radiological and blood-based biomarker measures. A secondary objective is to develop predictive algorithms for endovascular thrombectomy (EVT), which has been proven as an effective reperfusion therapy since study inception. METHODS AND ANALYSIS:The Targeting Optimal Thrombolysis Outcomes Study is a multicenter prospective cohort study of ischaemic stroke patients treated at participating Australian Stroke Centres with IVT and/or EVT. Patients undergo neuroimaging using multimodal CT or MRI at baseline with repeat neuroimaging 24 hours post-treatment. Baseline and follow-up blood samples are provided for research use. The primary outcome is good functional outcome at 90 days poststroke, defined as a modified Rankin Scale (mRS) Score of 0-2. Secondary outcomes are reperfusion, recanalisation, infarct core growth, change in stroke severity, poor functional outcome, excellent functional outcome and ordinal mRS at 90 days. Primary predictive models will be developed and validated in patients treated only with rt-PA. Models will be built using regression methods and include clinical variables, radiological measures from multimodal neuroimaging and blood-based biomarkers measured by mass spectrometry. Predictive accuracy will be quantified using c-statistics and R2. In secondary analyses, models will be developed in patients treated using EVT, with or without prior IVT, reflecting practice changes since original study design. ETHICS AND DISSEMINATION:Patients, or relatives when patients could not consent, provide written informed consent to participate. This study received approval from the Hunter New England Local Health District Human Research Ethics Committee (reference 14/10/15/4.02). Findings will be disseminated via peer-reviewed publications and conference presentations
    • …
    corecore